Page 1 of 1

2017 Conference article Restricted

Faster blockmaxwand with variable-sized blocks
Mallia A., Ottaviano G., Porciani E., Tonellotto N., Venturini R.
Query processing is one of the main bo.lenecks in large-scale search engines. Retrieving the top k most relevant documents for a given query can be extremely expensive, as it involves scoring large amounts of documents. Several dynamic pruning techniques have been introduced in the literature to tackle this problem, such as BlockMaxWAND, which splits the inverted index into constantsized blocks and stores the maximum document-Term scores per block; this information can be used during query execution to safely skip low-score documents, producing many-fold speedups over exhaustive methods. We introduce a re.nement for BlockMaxWANDthat uses variablesized blocks, rather than constant-sized. We set up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describe an effcient algorithm to .nd an approximate solution, with provable approximation guarantees. .rough an extensive experimental analysis we show that our method significantly outperforms the state of the art roughly by a factor 2x. We also introduce a compressed data structure to represent the additional block information, providing a compression ratio of roughly 50%, while incurring only a small speed degradation, no more than 10% with respect to its uncompressed counterpart.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 625–634, Shinjuku, Tokyo, Japan, 7-11 July, 2017
DOI: 10.1145/3077136.3080780
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Open Access

Optimal space-time tradeoffs for inverted indexes
Ottaviano G., Tonellotto N., Venturini R.
Inverted indexes are usually represented by dividing posting lists into constant-sized blocks and representing them with an encoder for sequences of integers. Different encoders yield a different point in the space-time trade-off curve, with the fastest being several times larger than the most space-efficient. An important design decision for an index is thus the choice of the fastest encoding method such that the index fits in the available memory. However, a better usage of the space budget could be obtained by using faster encoders for frequently accessed blocks, and more space-efficient ones those that are rarely accessed. To perform this choice optimally, we introduce a linear time algorithm that, given a query distribution and a set of encoders, selects the best encoder for each index block to obtain the lowest expected query processing time respecting a given space constraint. To demonstrate the effectiveness of this approach we perform an extensive experimental analysis, which shows that our algorithm produces indexes which are significantly faster than single-encoder indexes under several query processing strategies, while respecting the same space constraints.Source: 8th International Conference on Web search and data mining (WSDM 2015), pp. 47–56, Shanghai, China, 31/01/2015-06/02/2015
DOI: 10.1145/2684822.2685297
Metrics:

See at: www.di.unipi.it Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Restricted

Fast and space-efficient entity linking in queries
Blanco R., Ottaviano G., Meij E.
Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that lever-ages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredi-ents that make the algorithm fast and space-effcient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O (k 2 ) imple-mentation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times|at least two orders of magnitude faster than existing systems.Source: WSDM'15 - Eighth ACM International Conference on Web Search and Data Mining, pp. 179–188, Shanghai, Popular Republic of China, 31 January - 6 February 2015
DOI: 10.1145/2684822.2685317
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2014 Conference article Open Access

Cache-oblivious peeling of random hypergraphs
Belazzougui D., Boldi P., Ottaviano G., Venturini R., Vigna S.
The computation of a peeling order in a randomly generated hypergraph is the most time- consuming step in a number of constructions, such as perfect hashing schemes, random r-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges. We experimentally evaluate the performance of our implementation of this algorithm in a real- world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.Source: DCC 2014 - Data Compression Conference, pp. 352–361, Snowbird, Utah, USA, 26-28 March 2014
DOI: 10.1109/dcc.2014.48
DOI: 10.48550/arxiv.1312.0526
Project(s): MIDAS via OpenAIRE

, NADINE

Metrics:

See at: arXiv.org e-Print Archive Open Access | arxiv.org | doi.org Restricted | doi.org | CNR ExploRA

2014 Conference article Restricted

A SOA testing platform on the Cloud: the MIDAS experience
De Francesco A., Di Napoli C., Giordano M., Ottaviano G., Perego R., Tonellotto N.
The widespread use of SOA-based applications is posing new challenges to software testing since conventional methodologies are inadequate to cover the specific testing requirements of these applications. In this context, the European FP7 Project MIDAS aims at building an integrated platform for SOA testing automation designed and architected according to the SOA computing paradigm. In order to address the variability in the computational resources necessary for testing SOA applications of different complexity coming from different users, the MIDAS platform is designed as an integrated Testing as a Service (TaaS) framework deployed on a public Cloud infrastructure, available on a self-provisioning, pay-per-use, elastic basis. This paper introduces and discusses the Cloud-based software architecture envisioned in the MIDAS project by detailing the strategies adopted for its deployment on the target Cloud infrastructure (Amazon AWS).Source: INCoS 2014 - International Conference on Intelligent Networking and Collaborative Systems, pp. 659–664, Salerno, Italy, 10-12 settembre 2014
DOI: 10.1109/incos.2014.62
Project(s): MIDAS via OpenAIRE

Metrics:

See at: doi.org Restricted | CNR ExploRA

2014 Journal article Open Access

Fast compressed tries through path decompositions
Grossi R., Ottaviano G.
Tries are popular data structures for storing a set of strings, where common prefixes are represented by common root-to-node paths. More than 50 years of usage have produced many variants and implementations to overcome some of their limitations.We explore new succinct representations of path-decomposed tries and experimentally evaluate the corresponding reduction in space usage and memory latency, comparing with the state of the art.We study the following applications: compressed string dictionary and monotone minimal perfect hash for strings. In compressed string dictionary, we obtain data structures that outperform other state-of-the-art compressed dictionaries in space efficiency while obtaining predictable query times that are competitive with data structures preferred by the practitioners. On real-world datasets, our compressed tries obtain the smallest space (except for one case) and have the fastest lookup times, whereas access times are within 20% slower than the best-known solutions. In monotone minimal perfect hash for strings, our compressed tries perform several times faster than other trie-based monotone perfect hash functions while occupying nearly the same space. On real-world datasets, our tries are approximately 2 to 5 times faster than previous solutions, with a space occupancy less than 10% larger. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; E.1 [Data Structures]: Trees General Terms: Algorithms, Experimentation, Performance.Source: ACM journal of experimental algorithmics 19 (2014). doi:10.1145/2656332
DOI: 10.1145/2656332
DOI: 10.48550/arxiv.1111.5220
Metrics:

2014 Conference article Open Access

Partitioned Elias-Fano indexes
Ottaviano G., Venturini R.
The Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is competitive with some state-of-the-art methods such as -??-Golomb codes and PForDelta, it fails to exploit the local clustering that inverted lists usually exhibit, namely the presence of long subsequences of close identifiers. In this paper we describe a new representation based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-Fano, hence forming a two-level data structure. This partitioning enables the encoding to better adapt to the local statistics of the chunk, thus exploiting clustering and improving compression. We present two partition strategies, respectively with fixed and variable-length chunks. For the latter case we introduce a linear-time optimization algorithm which identifies the minimum-space partition up to an arbitrarily small approximation factor. We show that our partitioned Elias-Fano indexes offer significantly better compression than plain Elias-Fano, while preserving their query time eficiency. Furthermore, compared with other state-of-the-art compressed encodings, our indexes exhibit the best compression ratio/query time trade-off. Copyright 2014 ACM.Source: SIGIR 2014 - 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282, Gold Coast, QLD, Australia, 6-11 July 2014
DOI: 10.1145/2600428.2609615
Metrics:

See at: www.di.unipi.it Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2013 Report Unknown

See at: CNR ExploRA

2013 Conference article Open Access

Design of practical succinct data structures for large data collections
Grossi R., Ottaviano G.
We describe a set of basic succinct data structures which have been implemented as part of the Succinct library, and applications on top of the library: an index to speed-up the access to collections of semi-structured data, a compressed string dictionary, and a compressed dictionary for scored strings which supports top-k prefix matching.Source: SEA 2013 - Experimental Algorithms. 12th International Symposium, pp. 5–17, Rome, Italy, 5-7 June 2013
DOI: 10.1007/978-3-642-38527-8_3
Metrics:

See at: link.springer.com Open Access | www.di.unipi.it | doi.org Restricted | CNR ExploRA

2013 Conference article Open Access

Compressible motion fields
Ottaviano G., Kohli P.
Traditional video compression methods obtain a compact representation for image frames by computing coarse motion fields defined on patches of pixels called blocks, in order to compensate for the motion in the scene across frames. This piecewise constant approximation makes the motion field efficiently encodable, but it introduces block artifacts in the warped image frame. In this paper, we address the problem of estimating dense motion fields that, while accurately predicting one frame from a given reference frame by warping it with the field, are also compressible. We introduce a representation for motion fields based on wavelet bases, and approximate the compressibility of their coefficients with a piecewise smooth surrogate function that yields an objective function similar to classical optical flow formulations. We then show how to quantize and encode such coefficients with adaptive precision. We demonstrate the effectiveness of our approach by com- paring its performance with a state-of-the-art wavelet video encoder. Experimental results on a number of standard flow and video datasets reveal that our method significantly out- performs both block-based and optical-flow-based motion compensation algorithms.Source: CVPR 2013 - 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2251–2258, Portland, USA, 23-28 June 2013
DOI: 10.1109/cvpr.2013.292
Metrics:

See at: doi.org Restricted | CNR ExploRA

2013 Conference article Open Access

Space-efficient data structures for top-k completion
Hsu P., Ottaviano G.
Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings be- ginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compres- sion is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/ complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.Source: WWW 2013 - 22st International Conference on World Wide Web, pp. 583–594, Rio De Janeiro, Brazil, 13-17 May 2013

See at: dl.acm.org Open Access | CNR ExploRA